Prediction of Eukaryotic Translation Initiation Sites Using Machine Learning

نویسنده

  • Yoko Ishino
چکیده

The computational identification of translation initiation sites (TIS) is a major component of every gene prediction system, and is thus of major importance in genome annotation projects. A large number of machine learning methods have been described to identify TIS in transcripts such as mRNA, EST and cDNA sequences. In this regard, most of the prediction methods have focused on recognizing TIS in transcripts. However, recognizing TIS in transcripts is different from recognizing TIS in genomic sequences, mainly because of the following reasons. To begin with, the transcription start sites, which are needed to predict TIS in transcripts using the well-known scanning models, are usually unknown in genomes. Next, transcripts typically contain at most a few TIS candidates, while eukaryotic genomes contain millions of TIS candidates. Moreover, eukaryotic genomes contain introns, which disrupt the coding structure downstream of the TIS. This article focuses on the identification of TIS at the genomic level. Recently, on the other hand, proteogenomics, the integration of proteomics and genomics, has been raised, since the high-throughput identification of proteins and their accurate partial sequencing by tandem mass spectrometry combined with capillary liquid chromatography are now feasible for any cellular model at a full genomic scale. Proteogenomics can provide correction of current gene annotation, together with the possibility of identifying novel genes. In other words, the TIS prediction at the genomic level becomes simpler by using mass spectrometry-based proteome data. In such a situation, all you have to do for the TIS identification is to search the restricted region upstream of the most upstream peptide used to identify an expressed protein. Distinct sequence information for TIS identification in the vicinity of the start methionine enables us to easily predict TIS. To ascertain whether there is distinct sequence information for TIS identification in the vicinity of the start methionine, we applied two machine learning methods, decision tree learning and support vector machine (SVM), to discriminate true or false start methionine codons.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Translational Efficiency Prediction Using Structural Learning Method

It has been shown that sequence patterns including Shine-Dalgarno sequences, start codons and those in between have effect on translational efficiencies. However relationships between these patterns and translational efficiencies are complicated and sophisticated method are required for the modeling. In this work, we constructed a model to predict translational efficiencies of genes based on se...

متن کامل

StackTIS: A stacked generalization approach for effective prediction of translation initiation sites

The prediction of the translation initiation site in an mRNA or cDNA sequence is an essential step in gene prediction and an open research problem in bioinformatics. Although recent approaches perform well, more effective and reliable methodologies are solicited. We developed an adaptable data mining method, called StackTIS, which is modular and consists of three prediction components that are ...

متن کامل

Using feature generation and feature selection for accurate prediction of translation initiation sites.

Correct prediction of the translation initiation site (TIS) is an important issue in genomic research. We show that feature generation together with correlation based feature selection can be used with a variety of machine learning algorithms to give highly accurate translation initiation site prediction. Only very few features are needed and the results achieve comparable accuracy to the best ...

متن کامل

Protein Secondary Structure Prediction: a Literature Review with Focus on Machine Learning Approaches

DNA sequence, containing all genetic traits is not a functional entity. Instead, it transfers to protein sequences by transcription and translation processes. This protein sequence takes on a 3D structure later, which is a functional unit and can manage biological interactions using the information encoded in DNA. Every life process one can figure is undertaken by proteins with specific functio...

متن کامل

A Novel Data Mining Approach for the Accurate Prediction of Translation Initiation Sites

In an mRNA sequence, the prediction of the exact codon where the process of translation starts (Translation Initiation Site – TIS) is a particularly important problem. So far it has been tackled by several researchers that apply various statistical and machine learning techniques, achieving high accuracy levels, often over 90%. In this paper we propose a mahine learning approach can further imp...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009